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Abstract 


Statistical models for proteomics data often estimate protein fold changes between 
two samples, A and B, as the average peptide intensity from sample A divided by the 
average peptide intensity from sample B. Such average intensity ratios fail to take full 
advantage of the experimental design which eliminates unseen confounding variables 
by processing peptides from both samples under identical conditions. Typically this 
structure is exploited through the estimation of a protein ratio as the median ratio 
of matched peptide intensities. This simple solution fails to account for a substantial 
missing data bias which has led to the development of more sophisticated average 
intensity models. Here we develop the first statistical model that accounts for non- 
ignorable missingness while utilizing peptide level matched pairs across samples. Our 
simulation analysis shows that models which fail to utilize peptide level ratios, suffer 
astonishing losses to accuracy with basic ANOVA estimates having an average MSE 
371% higher than median ratio estimates. In turn, median ratio estimates have an 
average MSE 35% higher than our model estimates. An analysis of breast cancer 
data reinforces these relationships and shows that our model is capable of increasing 
the number of proteins estimated by 22%. 

Keywords: Label Free Quantitation (LFQ), Stable Isotope Labeling by Amino Acids in Cell 
Culture (SILAC), Skew Normal Distribution, Cibbs Sampler, Matched Pairs, Ionization 
Efficiency, Selection Model 
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1 Introduction 


At the highest level, proteomics is the large scale study of the structure and functions 
of proteins. An important class of studies within this field is shotgun, or discovery, pro¬ 
teomics. These experiments are designed to provide information on a large set of proteins 
that are not specihed before conducting the experiment. Discovery proteomics experiments 
typically necessitate the use of a mass spectrometer which entails an inferential step be¬ 
tween the readings of the mass spectrometer and the protein level outcomes of interest. 
Understanding the details of these experiments becomes essential in order to conduct a 
sensible analysis of proteomics data. A statistician might be fooled by the superficial simi¬ 
larities between microarray and proteomics data into simply adopting microarray methods 
to be used on protein data. Although the outcomes in the experiments share the same 
“intensity” designation, in reality the similarities end with the name. In a microarray ex¬ 
periment, an intensity refers to the observed brightness of a dye that will be present when 
a reaction occurs with some target molecule. The exact relationship between this inten¬ 


sity and the underlying molecular abundance is unclear but, as explained by Dabney and 


Storey (2007), a researcher at least believes the intensity to be a monotone increasing func¬ 


tion of the analyte concentration. Nothing of the sort can be claimed for intensities from 
shotgun proteomics experiments. To justify this statement it will be necessary to provide 
a basic overview of the shotgun proteomics experiment, including a detailed explanation 
of the ionization process and the different sources of missing data. We will show how the 
experimental details create two statistical features characteristic of all shotgun proteomics 
experiments; matched pairs data and non ignorable missingness. After justifying these 
features we propose a model that accounts for them and test the performance of this model 
on both simulated and real data. 


1.1 Bottom Up Relative Quantification Experiments 

Discovery proteomics experiments are usually a type of relative quantihcation experiment 


(Eidhammer et ah, 2008). The implication here is that the experiment can only be used 


to provide measures of protein abundance in one sample relative to another, without ever 
obtaining measures of the absolute abundance in either sample. Many experiments will 
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compare samples from numerous conditions but the simplest scenario is a comparison of 
protein abundances between two samples: Sample A and Sample B. A number of relative 
quantification workflows are available for proteomics to achieve this goal. These include 


Label-Free Quantihcation (LFQ) (Cox et al. 

2014) 

, Stable Isotope Labeling by Amino 

Acids in Cell Culture (SILAC) (Cox and Mann| 

2001 

B) and Isobaric Tags for Relative and 

Absolute Quantihcation (iTRAQ) (Ross et al. 

2004 

. The detailed workhows involved in 


these experiments are important and should determine the types of model that a statis¬ 
tician would consider. For instance the model proposed in this paper works for SILAC 
and LFQ data, but the missing data mechanism we employ is inappropriate for an iTRAQ 
experiment. Nonetheless a full explanation of each experimental workflow is not necessary 
for our purposes. The experiments described here are referred to as bottom-up proteomic 
methods, because proteins are too large to be identihed by mass which forces us to make 
inference about relative protein abundance from measurements obtained on amino acid 
fragments called peptides. A typical bottom-up proteomic workflow involves the extrac¬ 
tion of proteins from cells, tissues or biological secretions/fluids, followed by proteolysis 
which breaks proteins into peptides. Typically this is done by adding an enzyme called 
trypsin that specihcally cleaves proteins at lysine and arginine amino acid residues. After 
this digestion occurs peptides from the sample are separated according to each peptide’s 
hydrophobicity in a process called elution, where the more hydrophobic peptides will be 
the last to separate. After elution, peptides will travel towards an ionization device which 
converts the peptides into ions so that they may enter and be manipulated by a mass 
spectrometer. The ionization is usually done with electricity in the form of electrospray 
ionization (ESI) or with lasers by matrix-assisted laser desorption/ionization (MALDI). It 
is important to emphasize that both of these ionization technologies affect large numbers 
of peptides all at the same time and not all of them will successfully ionize. The elution 
process aims to completely separate each group of peptides but it does not work perfectly. 
When referring to the measurement made on a specihc peptide we may refer to all of the 
other pepties that were ionized at the same time as co-eluting peptides. Co-eluting pep¬ 
tides play an important role in determining the probability of ionization. After ionization 
the newly formed peptide ions are manipulated by a mass spectrometer which is capable of 
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separating the ions according to their mass and connting the nnmber of ions correspond¬ 
ing to each mass. Peptides with a large connts will be selected for fragmentation and a 
second mass spectrometry reading of the fragments. The process of selecting peptides for 
a second mass spectrometry step based on the relative magnitnde of the connts is called 
data-dependent analysis. The second mass spectrometry step is mostly used to identify 
the peptides that were just counted (quantihcation also happens during this step in an 
iTRAQ experiment). A summary of the ion counts for each now identihed peptide, usually 
computed as the area under a curve from a plot of counts through time (Cox and Mann| 


2008), is referred to as a peptide intensity. A more comprehensive description of the LFQ 


workflow can be found in a paper by Sandin et ah (2011). For the purposes of this paper 


we will focus on only the experimental details which motivated our statistical model. 


1.2 Matched Pairs Data 

Mass spectrometers work with ions because advances in technology have given us a tremen¬ 
dous ability to manipulate ions. For this reason ionization of peptide molecules is an indis¬ 
pensable aspect of a mass spectrometry proteomics experiment. Unfortunately, and this is 
a critical point, not all of the peptides from the sample will be ionized. Certain peptides 
tend to ionize more efficiently, while others will not ionize at all. The probability that a 
given peptide molecule will be ionized can be referred to as ionization efficiency. Ionization 
efficiency is believed to be a property of both the chemical structure of each peptide and 
the presence of other co-eluting peptides, sometimes referred to as matrix interferences. 


Schliekelman and Liu (2014) found that competition for charge between background pep¬ 


tides may actually be a more important factor than abundance in determining if a peptide 
will be detected. Regardless of what factors are most important, ionization efficiency can 
cause the proportion of peptides that enter into the mass spectrometer to be drastically 
altered. This is why we previously claimed that peptide intensities are not a monotone 
increasing function of peptide concentrations. One peptide might be far more abundant 
than another in the original sample but a lower ionization efficiency could reverse the re¬ 
lationship for peptide intensities. What we observe is the number of peptides in solution 
multiplied by that peptide’s ionization efficiency. Fortunately, if the efficiency parameter 
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Table 1: This table shows the relationship between relative protein abundance and the 
intensities of a peptide belonging to that protein, p is the probability that the peptide 
ionizes and makes its way into the mass spectrometer. pW and pZ represent the expected 
intensities from samples A and B respectively. 


Protein Abundance Peptide Abundance 
Sample k X W 

Sample B F Z 

Ratio y = h 'y ~ Y ~ ^ 


Ion Abundance 
pW 
pZ 


pW 

pZ 




is a property of the individual peptide, it will cancel out when put into a ratio with the 
same peptide from the other sample. This relationship is outlined in Table In a SILAC 
experiment every peptide in both samples will be processed at the same time and thus 
will be exposed to the same conditions yielding identical ionization efficiencies. However, 
even in a SILAC replicate the efficiency may be altered due to variations in sample prepa¬ 
ration and elution time resulting in a different profile for the background peptides. For 
a Label-Free experiment, run to run variation should always be expected. So unless the 
researchers have good reason to assume the ionization efficiencies will be equivalent, the 
outcomes may be confounded by run to run variation affecting each peptide differently. 
The incomplete and inconsistent ionization of analytes makes it impossible to accurately 
measure the abundance of proteins from the original sample. However in ratio form we 
can still make inference about the relative abundance of proteins in two samples. This 
is why proteomics experiments are often referred to as relative quantification experiments 


and it is why popular proteomics software packages, such as MaxQuant (Cox and Mann 


2008) estimate log protein ratios as the median of log peptide ratios. Similar methods 


dominate the techniques discussed by Eidhammer et ah (2008). Despite the ubiquity of 
median ratio estimates in proteomics software packages, none of the models we found in 
the literature make use of peptide level ratios beyond including peptide as a covariate in a 
linear model. Instead, most statistical methods estimate the log protein ratio from Sample 
A to B by taking the average log intensity across all peptides within the protein from Sam- 
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pie A and then snbtracting the average log intensity from all peptides within the protein 
from sample B. In the absence of missing data these methods are almost identical since 
E{X — Y) = E{X) — E{Y). However, in the presence of wide-spread missing data many 
peptides are detected in only one of the two samples. This makes it difficult to interpret the 
results from a method based on average intensities. Unfortunately wide-spread missingness 
is unavoidable in a discovery mass spectrometry experiment as explained below. 


1.3 Intensity-Dependent Missingness 

Unlike microarray experiments in which missing values often comprise about 1-11% of the 


data (de Brevern et ah, 2004), proteomics data sets almost always have a much higher 
percentage of missing data. In the dataset analyzed in this paper 25% of the peptides were 


missing and according to Karpievitch et al. (2009), 50% missing values are not uncommon. 
For this reason, the way we conceptualize and treat the missing data will take on huge 
importance. Here we will briefly discuss some of the largest causes of missing data. 


1.3.1 Detection Limit 

Mass spectrometers have both theoretical and a practical limits of detection (LOD). The 
theoretical LOD is the minimum number of ions a given instrument can capture and pro¬ 
duce an ion current with adequate signal enhancement. Although any peptide exceeding 
this number of ions can theoretically be detected by the mass spectrometer every sam¬ 
ple contains far more than one peptide which results in a considerable amount of noise. 
This noise results in a practical detection limit, which is dependent on the sample itself, 
whereby the software fails to distinguish peptide peaks from background noise. For this 
reason, sample-related factors that either result in a higher practical detection limit or a 
decreased intensity due to the nature of the sample can result in missing values. As pre¬ 
viously discussed, a major driver in this setting is the peptide ionization efficiency. If the 
ionization efficiency is low then the intensity will be low and may fall below the detection 
limit. This is clearly a form of non-ignorable missingness where the probability of being 
missing is directly related to the magnitude of the intensity. 
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1.3.2 Data-Dependent Tandem Mass Spectrometry 

Most mass spectrometers performing data-dependent analysis (DDA) do not succeed in 
fragmenting every ionized peptide. Peptides are mass selected (isolated) for a tandem MS 
(MS/MS) step according to their intensity rank order. This tandem mass spectrometry step 
is where peptide identification occurs, so any peptide signal that is not selected for tandem 
mass spectrometry will not yield useful data. On a side note, in an iTRAQ experiment 
quantification also occurs in the tandem step whereas in LFQ and SILAC experiments 
quantification occurs during the first MS step. This is why our proposed missing data 
mechanism does not apply to iTRAQ experiments. The data that we analyze in this paper 
was generated from a Q Exactive mass spectrometer made by Thermo Scientific, which 
was only capable of capturing approximately 80% of the peak intensities above the LOD. 
Thus even above the practical LOD an intensity dependent process can result in missing 
values. The number of detectable (identifiable) features can be increased with advances in 
the operating speed of the device, and complete parallelization can be achieved in newer 


Orbitrap instruments (Lesur and Domon, 2015). However, this results in a trade off where 


the identihcation process becomes less certain. In most settings we will have to consider 
two sources of non ignorable missingness, one which occurs below a frequently changing 
detection limit and another above. 


1.3.3 Misidentified/Unmatchable/Razor Peptides 

A peptide might appear in one sample and not in another simply because it was misiden- 
tihed. Matching algorithms are designed to minimize this problem but it undoubtedly still 
exists. A similar problem comes from razor peptides. These are peptides that are properly 
identihed but that could belong to more than one protein. Many software programs assign 
razor peptides to the protein that has the most peptide level evidence based on the Occam’s 


Razor concept of protein parsimony (Cox and Mann, 2008). Yet this process could result 


in misidentihcation and consequently, missing values. It is also possible that a particular 
peptide will simply fail to be identihed with any certainty which will also result in missing 
values. It is probably safe to classify missingness caused by classihcation errors as missing 
at random. 
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1.3.4 Modeling Missingness 

Many efforts have been made to correct for missing data biases in mass spectrometry 


experiments. A review of missing data techniqnes in proteomics by Taylor et al. (2013) 
compared three methods for removing missing data bias: an accelerated failnre time (AFT) 


model by Tekwe et al. (2012), a mixtnre model proposed by Karpievitch et al. (2009) 


and K-Nearest Neighbors (KNN) impntation as described by Troyanskaya et al. (2001). 
Notable additions to this gronp inclnde three Bayesian methods. One method proposed by 
Lno et aT| (2009) models missing probabilities with a logit fnnction. The Bayesian model 


of Lncas et al. (2012) attempts to acconnt for missingness cansed by misidentihcation 


and Koopmans et al. (2014) proposes a model which allows for a random detection limit. 
However, none of these methods ntilize peptide level ratios in their solntions which leaves 
room for improvement. On their merits as pnrely missing data techniqnes, only the model 


by Lno et al. (2009) theoretically acconnts for all the sonrces of missingness described 
above. The AFT model and the mixtnre model both assnme a hxed detection limit. Bnt we 
shonld expect this detection limit to vary from peptide to peptide depending on what other 
compounds are being processed in the background. Furthermore, both of them assume that 
any missingness above this theoretical detection limit should be missing at random with 
the mixture model explicitly categorizing all missingness as either at random or due to 
a detection limit. The use of any technology which uses Data Dependent Analysis will 
make this assumption false since data dependent analysis is a form of intensity dependent 


missingness that occurs above the detection limit. The interesting effort by Koopmans 


et al. (2014) which attempts to model a random detection limit analyzes data at the 


protein level. In other words, the estimation within an experiment has already occurred 
and it is unclear what missing data bias can be corrected. As for the KNN approach, it 
is difficult to imagine that there even could be a theoretical justihcation for imputing 40% 
of a dataset and then proceeding with an analysis as though the observations were real. 
Nonetheless, this option has made its way into software packages such as Inferno (Formerly 


called Dante) described by Polpitiya et al. (2008), so we will examine its efficacy later. The 
problem of ascertaining the cause of a missing value is probably intractable. However, we 
can say with some conhdence that the probability of an intensity being missing should be a 
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monotone increasing function of the intensity. Although the exact conditions and sources 
of missingness will vary from peptide to peptide and experiment to experiment a single 
monotone parametrized function of the probability of missingness could serve as a useful 
approximation for the conglomeration of missing data sources. For this reason, along with 
some mathematical niceties, we model a probit missing data mechanism such that for each 
peptide the probability of being observed is given by <h(a + hy) where <h is the standard 
normal CDF, y is the peptide intensity, and a and b are missingness parameters to be 
estimated in our analysis. 

2 Methods 

In order to symmetrically model log peptide intensities, for the rest of the paper we will 
refer to intensities only on the log scale, for each peptide we frame the problem in terms 
of the protein fold change (the difference between the intensities) and the midpoint of the 
two peptides, 

where Yij^k is the intensity of the jth peptide within the fth protein from the kth sample, 
k = 1,2, i = 1,... ,n indexes the unique proteins in the samples and j = 1 ,...,m* indexes 
the peptides within the fth protein, 0 :^( 4 ) represents the midpoint of the two intensities of 
peptide j within protein i and /i* represents the protein fold change. The notation N{I3, a) 
denotes a normal random variable with mean f5 and variance a. The mixed model definition 
is completed with ~ N{l3a,0 independent from /Xj ~ N{I3^,t). 

This midpoint mixed model (M3) provides a symmetric framework for analyzing a 
proteomics experiment in one statistical model while accounting for the matched pairs 
nature of the data. The model can easily be £t using standard software such as PROC 
MIXED in SAS or Ime from the NLME package in R. We expect this model to provide 
similar results to standard ratio-based methods and improved downstream analysis by 
creating a single estimate of experimental variance. In fact, if we used fixed effects in 
place of random effects M3 is a reparameterization of a linear model with covariates for 
peptide within protein, sample, and a sample*protein interaction. Of course this model 
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completely fails to account for missing data bias. We expand the M3 model into a selection 
model with a probit missingness mechanism. We refer to this new model as the midpoint 
mixed model with a missingness mechanism (M5). Let /() be an indicator function so that 
Rijk = liXijk is observed). We assume ~ Bernoulli{^{a + bYijk)). 


Fitting this model is greatly complicated by the number of missing values in a proteomics 
experiment. In our dataset, there were certain proteins with over 200 missing values, and 
integrating the likelihood 200 times created computational difficulties. We resolved this 
issue by giving each parameter a non-informative prior and using a Gibbs Sampler. The 
Gibbs Sampler required three sampling steps that were non-standard: the distribution 
of a missing value given everything else where Yijk' is the matched pair 

corresponding to the missing value and 9 is the vector of parameters (a, b, r, a, (3a, (3^)] the 
distribution of a protein fold change given everything else /(/ii|ai,Y,e,R); and the distribution 
of a midpoint given everything else A bit of manipulation reveals that the 

distribution of a missing value follows the Extended Skew Normal distribution as described 


by Azzalini and Gapitanio (2014). 




where 


(i){^)^{-a-bx) 
y/a^{u:) 

hx = «iP) + (-l) y, U=-J===. 


We also find that 


{fii\ai,Y,9,R) 


N 


+ i Ej iviji - Vij-i) 


(JT 


a+ ^ 


-- I rrijT 

a + ^ 


and 


(ajjl/ij, Y,6',R) ~ N 


f (^a<^ + + yij2) 

V cr + 2^ 


—)■ 

cr -I- 2^/ 


Proofs of these results can be found in Appendix Although these formulas are complex 
the results are somewhat intuitive. Each missing value from a pair of points comes from 
a skew normal distribution where the skew is determined by the missing data mechanism. 
The fold change comes from a distribution centered around a weighted average of the mean 
protein fold change and the average of the observed differences in peptide intensities. The 
peptide midpoint is drawn from a distribution centered around a weighted average of the 
mean peptide midpoint and the observed midpoint of the pair of intensities. 
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Table 2: The prior and posterior distributions used to complete the model. The parameters 
with non standard prior and posterior distributions are described in the text. 


Parameter 

Prior 

Posterior 

T 

InverseGamma{.001, .001) 

InverseGamma{.001 + n/2, .001 -|- 

e 

InverseGamma{.001, .001) 

/?werseGamma(.001 -|- ^ mi/2, .001 -1- 

a 

InverseGamma{.001, .001) 

JnnerseGamma(.001 ^ 2 * mj/2, .001 -|- ^^) 


Ar(0,10000) 

( 10000 + )) 


Ar(0,10000) 

A'(EA./r,(iski + P')) 

a 

Ar(0,10000) 

Probit Regression Estimation 

b 

Ar(0,10000) 

Probit Regression Estimation 


The Bayesian model formulation is completed with the priors in Table The posterior 
distribution of (a, b) is estimated by htting the probit regression model 

•h (£/ll/jjfc]) o T byijk 


The posterior distribution is then approximated as 



Where a, b and S are the parameter estimates from the probit regression and their cor¬ 
responding covariance estimate, respectively. The bivariate normal distribution used here 
approximates the posterior distribution as a consequence of Bayesian large sample theory 

chap. 4). 


(Gelman et ah, 2004 


2.1 Simulation Study 

To explore the potential benehts of the M5 model, we conduct simulations to compare the 
accuracy of our estimates to six other estimation procedures. In addition to the M3 and 
M5 models, we analyze the commonly used method of median ratios, a one-way ANOVA 
for protein within sample, QRollup and QRollup performed after implementing a weighted 
K-Nearest Neighbors imputation (KNNQ). QRollup is a method of analyzing average in¬ 
tensities that seeks to avoid missing data bias by analyzing only the largest 66% of the 
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intensities within each protein/sample combination. A software package called Inferno im¬ 


plements the QRollup method(Polpitiya et ah, 2008). Similar approaches are discussed 


by Eidhammer et ah (2008). The Inferno software also offers an option to use Weighted 
K-Nearest Neighbors Imputation, which is why we coupled imputation with the QRollup 


method. The ANOVA model is intentionally simpler than that described by Oberg and 


Mahoney (2012), as we wanted to use an example of a model where proteins are estimated 


as the average intensity within groups regardless of the presence of a matched pair. Lin¬ 
ear models that include peptide as a covariate should perform similarly to the M3 model. 
This set of methods gives us a look at the performance of three methods based on ratios 
and three methods based on average intensities. All methods, except for M3 and M5, are 
currently supported by proteomic software packages. 

In our simulation the hyperparameter values were 


r = 9, ^ = 4, (T = .3, a = —9, b = .5, /3q = 18.5, = 0. 

We generated 500 protein fold changes from a r) distribution and generated the num¬ 

ber of constituent peptides within each protein by sampling with replacement from the set 
{1,..., 12}. For each peptide we generated independent random midpoints from a N[f3a, 
distribution. We generated independent residual errors, eijk as A^(0, a) random variables. 
Then we created intensities yijk = aj(i) + + ^ijk- Next we simulated missingness 

by computing probabilities of missingness for each intensity as Pijk = $(a -|- bi/ijk), then 
we randomly drew Bernoulli random variables, {Rijk), according to those probabilities, to 
identify which i/ijk are missing. We then £t all six models including 1,000 draws from the 
M5 Gibbs Sampler to create M5 estimates. Results were recorded and the whole process 
was repeated 100 times. 


2.2 Protein Categories 

Before comparing the six methods, some classification of observation patterns is needed 
since not all methods are capable of estimating the same proteins. To this end we classify 
each protein as “matched”, “unmatched”, “one-sided” or “missing”. Figure presents a 
visual depiction. 
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Figure 1: The three categories of proteins. Matched proteins contain at least one matched 
peptide pair. Unmatched proteins contain data from both conditions but no matched pairs. 
One-sided proteins contain peptide measurements from only one sample. 

Missing proteins are proteins for which peptides were identified but no peptide intensi¬ 
ties were observed. These are not interesting and even though they can be estimated with 
M3, M5 or KNNQ we recommend just removing them from the study. Matched proteins 
are proteins that have at least one matched peptide pair. With at least one shared peptide 
from each sample, all of the methods can be used for estimation. An unmatched protein has 
observed intensities from each sample but no peptides that are quantified in both samples. 
One-Sided proteins have intensities from peptides in only one sample and are completely 
missing in the other. This can be indicative of a large fold change difference. M5, M3 and 
KNNQ can be used to estimate all types of proteins. The ANOVA model and QRollup can 
be used for both matched and unmatched proteins while the method of median ratios can 
only be used on matched proteins. 

2.3 Simulation Results 

The sampling chains all appeared to achieve stationary distributions after about 300 draws 
(in the real data this was achieved within 50). For this reason, our estimates were based 
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Matched Protein MSE across 100 Simulations 



Figure 2: MSE for each method computed across matched Proteins and within each simu¬ 
lation. 

on the posterior mean after a burn in length of 500 draws. 

Figure shows the distribution of mean squared errors across simulations. The most 
obvious result here is that the methods based on ratios are far outperforming methods 
based on average intensities. M5 demonstrates the best performance with an average MSE 
of 0.26. The method of medians was the second most accurate with an MSE of 0.35, 
which represents an increase in error of 35%. This is a fairly large increase but it is hardly 
noticeable relative to the error coming from the average intensity methods. The best of 
these was QRollup with an average MSE of 1, which represents an increase of 285%. It 
should be noted that the commonly used validation tool of correlation does not do a very 
good job of assessing algorithmic weaknesses here. Figure shows that even though some 
of these methods more than sextuple mean squared error, the lowest correlation coefficient 
is still above 0.9. It should also be noted that the use of Weighted K-Nearest Neighbors 
appears to be detrimental to the accuracy of QRollup estimates. 

These relationships can be further explored by categorizing proteins according to the 
percentage of peptides which are missing as shown in Figure This plot shows that the 
error for the KNNQ method increases substantially once more than 50% of the data requires 
imputation. In this chart we can see that, as missingness increases, the ANOVA estimation 
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Figure 3: Scatterplot of true simulated fold changes for matched Proteins vs their estimates 
across all simulations. Correlation coefficients are also computed across all simulations. 

also loses accuracy at a much faster rate than the other methods. This is likely because 
the ANOVA model simply reports average intensities within each sample regardless of the 
amount of missing data. 

In the case of one-sided and unmatched Proteins the method of medians is obviously not 
applicable. Among the other methods the rank ordering based on average MSE remains 
the same, (see Figure]^. 

In this case the average MSE for M5 is 1.5 and the second best is the M3 model at 2.4. 
The best average intensity method was the ANOVA model with an MSE of 3. Correlation 
coefficients are much weaker in this category as pictured in Figure]^ 

Arguably the greatest advantage to using the M5 model comes from the ability to 
estimate One-Sided proteins. These proteins are difficult to estimate since one of the 
samples provides no observed values. Keep in mind for a One-Sided protein that we could 
assume that the abundance of the missing value is between zero and the detection limit. 
However, the upper bound on an abundance ratio is inhnite. Nonetheless, M5 does provide 
decent estimates in this situation as can be seen in Figure Only three methods were 
capable of estimating one-sided proteins and only one of them could be considered useful. 
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Proportion of Missing Peptides 


Figure 4: MSE for each method according to categories of the percentage of missing pep¬ 
tides. The MSE for KNNQ at 75% missing data is 12.81 which was too extreme to be 
plotted along with the other methods. 


Unmatched Protein MSE across 100 Simulations 



Figure 5: MSE for each method computed across unmatched proteins and within each 
simulation. The method of medians, MED, is not applicable to unmatched proteins. 
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Correlation - 0.78 , Average MSE = 3 



true value 


true value 


Figure 6: Scatterplot of true simulated fold changes for unmatched Proteins vs their esti¬ 
mates across all simulations. Correlation coefficients are computed across all simulations. 


One-Sided Protein MSE across 100 Simuiations 


LiJ 

C« 



M5 


M3 KNNQ 


Figure 7: MSE for each method computed across One-Sided proteins and within each 
simulation. The method of medians (MED), ANOVA, and QRollup (Q) are not applicable 
to One-Sided proteins 
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Figure 8: Scatter plot of true simulated fold changes for one-sided proteins vs their estimates 
across all simulations. Correlation coefficients are computed across all simulations. 

The MSEs for M5, M3 and KNNQ were respectively 2.7, 8.6 and 27. The range of the 
log-scale fold changes in this simulation were roughly -10 to 10. So an average MSE for 
M5 of 2.7 is certainly small enough for the estimates to be of interest. The scatter plot in 
Figure 1^ strongly highlights the advantages of the M5 model. 

3 Breast Cancer Data 

In order to make sure the results of our simulation study are not artifacts of the data 
generation procedure, we also analyzed the effect of non-informative missingness on a real 
data set. The data, generated by the Chen Biochemistry Lab, contains peptide level LFQ 
measurements from two samples of breast cancer tissue (one Basal and one Luminal). This 
dataset can be found in the supplementary material. 11,866 unique proteins were identihed 
in the data, of these 594 were Missing, 9,265 had at least one peptide pair, 1,810 were one¬ 
sided and 197 had intensities in both samples but no matched pairs. This breakdown is 
pictured in Figure]^ 

Not considering missing proteins, we can see that before we even do an analysis the 


19 





Breakdown of Proteins in the WHiM2/WHiM16 Data 
M5 Enabies Estimation of 22 % More Proteins 
than the Method of Medians 



Figure 9: Distribution of proteins in the tumor data. 


M5 model is capable of estimating an additional 2,007 (22%) proteins compared with the 
method of medians. This would be a substantial gain if our method is capable of estimating 
those proteins with a decent level of accuracy as our simulations suggest. Furthermore, the 
entire data set contains information on 248,342 peptides, 61,418 (about 25%) of which are 
missing. There is a tremendous amount of information in the patterns of those 61,418 
missing values, and in theory the M5 model takes full advantage of them. M5 model 
estimates were computed on a random set of 1,000 proteins and 95% credible regions were 
computed. 1,000 draws from the gibbs sampler were used with a burn in length of 500. We 
reduced the data size purely for computational simplicity. From the 1,000 proteins, 252 
matched Proteins and 4 one-sided Proteins did not contain zero in their credible intervals 


(Figure 10). Among the 4 one-sided Proteins is protein 075363 which is better known 
as the gene product for the Novel Amplihed in Breast Cancer-1 gene (NABCl). This 
gene is known to be involved in cancer typically being upregulated in breast cancers and 


downregulated in colon cancer (Beardsley et ah, 2003). Since this protein was one-sided 


in our dataset no useful information regarding NABCl would have been found without 
the M5 model. After estimation, we computed the square of the M5 estimates divided 
by the posterior standard deviation. The proteins were then ranked in descending order 
and are presented in Table We also used the real data to study the effects of non- 
ignorable missingness on each of the six estimation methods. To accomplish this goal we 
hrst reduce the data to allow a complete case analysis, so that only peptides with observed 
intensities from both samples are included in the reduced dataset. From this complete-case 


20 







Table 3: Twenty proteins ordered by the highest squared ratio of posterior mean to posterior 
standard error. At the bottom of the table is the complete set of one-sided proteins for 
which the credible region did not include zero. 


Protein 

Estimate 

Posterior SD 

Category 

P48681 

-5.39 

0.300 

Matched 

095425 

-3.72 

0.230 

Matched 

076070 

5.31 

0.420 

Matched 

Q13557 

-2.64 

0.231 

Matched 

F8VTL3 

-1.89 

0.169 

Matched 

Q07065 

-2.33 

0.211 

Matched 

Q13813 

-0.97 

0.096 

Matched 

P12270 

-1.12 

0.112 

Matched 

G3XAI2 

-2.26 

0.241 

Matched 

Q9Y4L1 

-1.42 

0.163 

Matched 

P97457 

4.653 

0.597 

Matched 

Q86SF2 

3.39 

0.436 

Matched 

060231 

-1.93 

0.25 

Matched 

Q13363 

3.58 

0.501 

Matched 

Q9NX62 

2.16 

0.302 

Matched 

Q5T6V5 

3.02 

0.426 

Matched 

Q86WJ1 

-2.70 

0.386 

Matched 

P23786 

1.77 

0.261 

Matched 

Q6BCY4 

-2.55 

0.376 

Matched 

Q9NP74 

-2.37 

0.357 

Matched 

P12109 

-3.59 

0.813 

One-sided 

Q16666 

1.76 

0.823 

One-sided 

B4E1Z4 

2.55 

0.817 

One-sided 

075363 

2.90 

0.944 

One-sided 


21 




Ordered Proteins 


Figure 10: M5 estimates of the log fold change between proteins found in Basal and Luminal 
breast cancer tissues. Only proteins with 95% credible intervals that do not contain zero 
are pictured. 

data, 500 proteins were randomly selected for a sensitivity analysis. The mean peptide 
ratio within each protein was calculated and considered to be the reference value. We 
explored what happens to the estimates from each model as higher levels of intensity- 
dependent missingness are introduced. Appropriate values of the missingness parameter b 
were discovered by trial and error to provide overall missingness levels of 1, 5, 10, 20, 30, 
40 and 50 percent. Mean squared error was then calculated for each of the six methods on 
all 7 datasets. 

3.1 Results of the Sensitivity Analysis 

We explored the effect of intensity dependent missingness on complete case estimates. The 
performance demonstrated similar patterns to what we found in the simulation analysis 
with ratio based methods having far more stable estimates than the average intensity 
based methods, shown in Figure [TTj 

These plots paint a picture consistent with the results from the simulation study. The 
M5 model outperforms all other methods. The method of median ratios and the M3 
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1% 5% 10% 20% 30% 40% 50% 


Percent Missing 


Figure 11: MSE computed as the average squared difference between the estimate from a 
complete case analysis and the estimate from a dataset with simulated intensity dependent 
missingness. MSE for QRollup with weighted KNN imputation takes the values 2.34 and 
3.31 at 40 and 50 percent missingness respectively. These values were too extreme to be 
plotted with the other methods. 
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model have very similar performance. The ANOVA model and QRollup methods perform 
comparably to the ratio-based methods until the missingness is increased to around 10%. 
Once missingness hits 40%, the difference in frameworks becomes substantial, and at 50% 
the average MSE from KNNQ is about eleven times higher than that from the M5 model. 


4 Misspecification of the Missing Data Mechanism 


An obvious artihcial strength of the simulation study is the use of the same missing data 
mechanism in both the simulation and the analysis. The scientihc process supports the 
use of a missing data mechanism in which the probability of a peptide being observed is 
a monotone increasing function of the intensity. Beyond this basic structure very little 
evidence exists to suggest the proper shape of this curve. Our probit model fits the mono¬ 
tonicity requirement however it is not unique in doing so. To examine the robustness to 
misspecihcation of the missing data mechanism we compare estimation results from three 
different missing data mechanisms; a linear model within a probit function, a quadratic 
model within a probit function and a linear model within a logit function. Data was 
simulated 100 times and for each data set a different set of missing values was simulated 
according to the three different models. Simulation parameters were selected so that the 


overall percentage of missing values would be near 33%. As pictured in Figure 12, the 
results suggest that the M5 model is fairly robust to misspecihcation of the missing data 
mechanism. Amongst matched proteins the worst case scenario occurred when the real 
mechanism was a logit model. In this case the average MSE increased by 8% from .26 
to .28, which is still 20% lower than the MSE for the method of medians found in the 
simulation study. Misspecihcation from a quadratic model actually reduced the average 
MSE by 10%. These results seem to suggest that sharper the increase in the probability 
of observing a peptide the better our model will perform. For one-sided and unmatched 
proteins the ehects of misspecihcation were more pronounced. For one-sided proteins we 
observed a 42% increase in average MSE with a probit misspecihcation and a 34% decrease 
for the quadratic model. For unmatched proteins these changes were 43% and -53% respec¬ 
tively. The increased ehect of the missing data mechanism for these proteins should not be 
surprising since the mechanism plays a larger role in the estimation when no matched pairs 
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Matched Protein MSE 


One Sided Protein MSE 


Unmatched Protein MSE 



I -1-1 I-1-1 I- 1-1 

Linear Quadratic Logit Linear Quadratic Logit Linear Quadrate; Logit 

Missing Data Mechanism Missing Data Mechanism Missing Data Mechanism 


Figure 12: MSE computed from 100 simulations utilizing 3 different missing data mecha¬ 
nisms. 

are observed. In the worst case scenario the average MSE for a one-sided Protein from a 
logit missing data model was 3.87. With a range of fold changes in the data from roughly 
-10 to 10 an MSE of 3.87 is highly encouraging as it suggests that even with misspecihca- 
tion the M5 model provides a legitimate way to detect one-sided proteins with large fold 
changes. 

5 Conclusion 

We have identihed the two fundamental statistical features of mass spectrometry pro- 
teomics as matched pairs data and non-ignorable missingness. Of the two features, ignoring 
matched pairs appears to be far more detrimental than ignoring the missing data bias. Not 
only is the average intensity across peptides difficult to interpret, but the simple method 
of taking the median ratio greatly outperforms methods based on average intensities in 
terms of mean squared error. In turn, relative to the method of medians, our M5 model is 
capable of improving both the depth and accuracy of the mass spectrometry experiment. 
To the best of our knowledge, this is the hrst model that provides reliable estimation of 
one-sided proteins and our experiments and sensitivity analysis suggests that these esti¬ 
mates could be valuable even when the missing data mechanism has been misspecihed. A 
great deal of work can be done to extend this model. Here we have only demonstrated the 
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importance of peptide level matching with a model that compares two samples from either 
a SILAC or LFQ experiment. A different missing data mechanism would be required to 
£t iTRAQ data and extensions for multiple sample comparisons and sample fractionation 
are not immediately obvious. In whatever way these more complicated problems might 
be solved, the lessons from this study should be incorporated into the solution. The joint 
presence of matched pairs data and non-ignorable missingness can greatly inflate the error 
in estimation procedures that do not explicitly account for both features. 


A APPENDIX: Proofs 
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SUPPLEMENTARY MATERIAL 

R-code for implementing the M5 model R code for performing M5 and the other 
methods described in the article. 

Tumor data set: Data set used in the sensitivity analysis (.csv hie) 
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